import pandas as pd
import numpy as np

df = pd.read_csv('bank_marketing.csv', usecols=lambda column: column != "index")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 16 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   client_id                   41188 non-null  int64  
 1   age                         41188 non-null  int64  
 2   job                         41188 non-null  object 
 3   marital                     41188 non-null  object 
 4   education                   41188 non-null  object 
 5   credit_default              41188 non-null  object 
 6   mortgage                    41188 non-null  object 
 7   month                       41188 non-null  object 
 8   day                         41188 non-null  int64  
 9   contact_duration            41188 non-null  int64  
 10  number_contacts             41188 non-null  int64  
 11  previous_campaign_contacts  41188 non-null  int64  
 12  previous_outcome            41188 non-null  object 
 13  cons_price_idx              41188 non-null  float64
 14  euribor_three_months        41188 non-null  float64
 15  campaign_outcome            41188 non-null  object 
dtypes: float64(2), int64(6), object(8)
memory usage: 5.0+ MB

# Split into the three tables
client = df[["client_id", "age", "job", "marital", 
                    "education", "credit_default", "mortgage"]]
campaign = df[["client_id", "number_contacts", "month", "day", 
               "contact_duration", "previous_campaign_contacts", "previous_outcome", "campaign_outcome"]]
economics = df[["client_id", "cons_price_idx", "euribor_three_months"]]

client

client_columns = ['client_id', 'age', 'job', 'marital', 'education', 'credit_default', 'mortgage']

for col in client_columns:
    print(f"Value counts for column: {col}\n")
    print(client[col].value_counts())  # Correct way to access column
    print("\n" + "-"*50 + "\n")  # Separator for readability

Value counts for column: client_id

client_id
0        1
27512    1
27454    1
27455    1
27456    1
        ..
13730    1
13731    1
13732    1
13733    1
41187    1
Name: count, Length: 41188, dtype: int64

--------------------------------------------------

Value counts for column: age

age
31    1947
32    1846
33    1833
36    1780
35    1759
      ... 
89       2
91       2
94       1
87       1
95       1
Name: count, Length: 78, dtype: int64

--------------------------------------------------

Value counts for column: job

job
admin.           10422
blue-collar       9254
technician        6743
services          3969
management        2924
retired           1720
entrepreneur      1456
self-employed     1421
housemaid         1060
unemployed        1014
student            875
unknown            330
Name: count, dtype: int64

--------------------------------------------------

Value counts for column: marital

marital
married     24928
single      11568
divorced     4612
unknown        80
Name: count, dtype: int64

--------------------------------------------------

Value counts for column: education

education
university.degree      12168
high.school             9515
basic.9y                6045
professional.course     5243
basic.4y                4176
basic.6y                2292
unknown                 1731
illiterate                18
Name: count, dtype: int64

--------------------------------------------------

Value counts for column: credit_default

credit_default
no         32588
unknown     8597
yes            3
Name: count, dtype: int64

--------------------------------------------------

Value counts for column: mortgage

mortgage
yes        21576
no         18622
unknown      990
Name: count, dtype: int64

--------------------------------------------------

# Cleaning education column
client = client.copy()  # This removes ambiguity
client["education"] = client["education"].str.replace(".", "_", regex=False)
client["education"] = client["education"].replace("unknown", np.NaN)

# Cleaning job column
client["job"] = client["job"].str.replace(".", "_")

# Clean and convert client columns to bool data type
for col in ["credit_default", "mortgage"]:
  client[col] = client[col].map({"yes": 1,
                                 "no": 0,
                                 "unknown": 0})
  client[col] = client[col].astype(bool)

campaign

# create a campaign copy to avoid ambiguity 
campaign = campaign.copy()

# Change campaign_outcome to binary values
campaign["campaign_outcome"] = campaign["campaign_outcome"].map({"yes": 1, "no": 0})

# Convert previous_outcome to binary values
campaign["previous_outcome"] = campaign["previous_outcome"].map({
    "success": 1,
    "failure": 0,
    "nonexistent": 0
})

# Clean and convert outcome columns to bool
for col in ["campaign_outcome", "previous_outcome"]:
  campaign[col] = campaign[col].astype(bool)

# Adding column 'year' in campaign data
campaign['year']= '2022'

# check the type of month and day
campaign[['day', 'month']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   day     41188 non-null  int64 
 1   month   41188 non-null  object
dtypes: int64(1), object(1)
memory usage: 643.7+ KB

# we will change the type of 'day' from int to str
campaign['day'] = campaign['day'].astype(str)

# Add last_contact_date column
campaign["last_contact_date"] = campaign["year"] + "-" + campaign["month"] + "-" + campaign["day"]

# Convert to datetime
campaign["last_contact_date"] = pd.to_datetime(campaign["last_contact_date"], 
                                               format="%Y-%b-%d")

# Drop unneccessary columns
campaign.drop(columns=["month", "day", "year"], inplace=True)

# Save tables to individual csv files
client.to_csv("client.csv", index=False)
campaign.to_csv("campaign.csv", index=False)
economics.to_csv("economics.csv", index=False)

Column	Data Type	Description	Cleaning Requirements
`client_id`	integer	Client ID	No changes needed
`age`	integer	Client's age in years	No changes needed
`job`	object	Client's type of job	Replace `"."` with `"_"`
`marital`	object	Client's marital status	No changes needed
`education`	object	Client's level of education	Replace `"."` with `"_"` and `"unknown"` with `np.NaN`
`credit_default`	bool	Whether the client’s credit is in default	Convert to boolean: `1` if `"yes"`, otherwise `0`
`mortgage`	bool	Whether the client has a mortgage	Convert to boolean: `1` if `"yes"`, otherwise `0`

Column	Data Type	Description	Cleaning Requirements
`client_id`	integer	Client ID	No changes needed
`number_contacts`	integer	Number of contact attempts in current campaign	No changes needed
`contact_duration`	integer	Duration of last contact in seconds	No changes needed
`previous_campaign_contacts`	integer	Number of contact attempts in previous campaign	No changes needed
`previous_outcome`	bool	Outcome of previous campaign	Convert to boolean: `1` if `"success"`, otherwise `0`
`campaign_outcome`	bool	Outcome of current campaign	Convert to boolean: `1` if `"yes"`, otherwise `0`
`last_contact_date`	datetime	Last date client was contacted	Create from combining `day`, `month`, and a new `year` column with value `2022` Format as `"YYYY-MM-DD"`

Column	Data Type	Description	Cleaning Requirements
`client_id`	integer	Client ID	No changes needed
`cons_price_idx`	float	Consumer price index (monthly indicator)	No changes needed
`euribor_three_months`	float	Euro Interbank Offered Rate three-month rate (daily indicator)	No changes needed

Project Description¶

`client.csv`¶

`campaign.csv`¶

`economics.csv`¶

We will split and tidy bank_marketing.csv, storing as three DataFrames called client, campaign, and economics, each containing the columns outlined in the notebook and formatted to the data types listed.¶

Beginng with client data¶

Since tha client data has been cleaned, we will move on to Campaign data¶

Now our data is cleaned, we splitted the data into three datframes. We can now save our resultant files individually.¶

	client_id	age	job	marital	education	credit_default	mortgage
0	0	56	housemaid	married	basic.4y	no	no
1	1	57	services	married	high.school	unknown	no
2	2	37	services	married	high.school	no	yes
3	3	40	admin.	married	basic.6y	no	no
4	4	56	services	married	high.school	no	no
...	...	...	...	...	...	...	...
41183	41183	73	retired	married	professional.course	no	yes
41184	41184	46	blue-collar	married	professional.course	no	no
41185	41185	56	retired	married	university.degree	no	yes
41186	41186	44	technician	married	professional.course	no	no
41187	41187	74	retired	married	professional.course	no	yes

	client_id	number_contacts	month	day	contact_duration	previous_campaign_contacts	previous_outcome	campaign_outcome
0	0	1	may	13	261	0	NaN	NaN
1	1	1	may	19	149	0	NaN	NaN
2	2	1	may	23	226	0	NaN	NaN
3	3	1	may	27	151	0	NaN	NaN
4	4	1	may	3	307	0	NaN	NaN
...	...	...	...	...	...	...	...	...
41183	41183	1	nov	30	334	0	NaN	NaN
41184	41184	1	nov	6	383	0	NaN	NaN
41185	41185	2	nov	24	189	0	NaN	NaN
41186	41186	1	nov	17	442	0	NaN	NaN
41187	41187	3	nov	23	239	1	NaN	NaN

Project Description¶

client.csv¶

campaign.csv¶

economics.csv¶

We will split and tidy bank_marketing.csv, storing as three DataFrames called client, campaign, and economics, each containing the columns outlined in the notebook and formatted to the data types listed.¶

Beginng with client data¶

Since tha client data has been cleaned, we will move on to Campaign data¶

Now our data is cleaned, we splitted the data into three datframes. We can now save our resultant files individually.¶

`client.csv`¶

`campaign.csv`¶

`economics.csv`¶